This report provides an evaluation of the accuracy and precision of probabilistic forecasts of COVID-19 cases and deaths submitted to the US COVID-19 Forecast Hub. Some analyses include forecasts submitted in the past twelve months, starting in June 25, 2022. Others focus on evaluating “recent” forecasts, submitted only in the last 10 weeks.
In collaboration with the US Centers for Disease Control and Prevention (CDC), the COVID-19 Forecast hub collects short-term COVID-19 forecasts from dozens of research groups around the globe. Every week we combine the most recent forecasts from each team into a single “ensemble” forecast for each of the targets. This forecast is used as the official ensemble forecast of the CDC, typically appearing on their forecasting website on Wednesday. You can explore the full set of models, including their forecasts for past weeks online at the Forecast Hub interactive visualization. Other related resources include CMU Delphi’s forecast evaluation dashboard, a separate product of the Forecast Evaluation Research Collaborative, as well as the preprint Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US.
As of February 20, 2023, we are no longer generating ensemble case forecasts, and as of March 6, 2023, we are no longer generating ensemble death forecasts. Reports up to March 13, 2023 include case and death forecast evaluations, but after that date they only include hospitalization forecast evaluations.
This report evaluates forecasts at the state and national level for newly reported hospitalizations due to COVID-19. Data from the HealthData.gov is used as ground truth data for evaluating the forecasts.
As of September 28, 2021, COVIDhub-ensemble only reports 14 day ahead forecasts for hospitalizations. For a more complete explanation see our blog post here.
To reduce duplication of results, the COVIDhub_CDC-ensemble and COVIDhub-ensemble are omitted from this evaluation. The COVIDhub_CDC-ensemble pulls a subset of forecasts of hospitalizations from the COVIDhub-4_week_ensemble and the COVIDhub-ensemble nearly matches the COVIDhub-4_week_ensemble and COVIDhub-trained_ensemble predictions for those targets up to occasional small differences in the included models. As a result, the performance of the COVIDhub_CDC-ensemble and COVIDhub-ensemble models matches or nearly matches the performance of the COVIDhub-4_week_ensemble and COVIDhub-trained_ensemble on those targets. For more information about COVID19 forecast hub ensemble methods see this page.
We evaluate models based on their adjusted relative weighted interval scores (WIS, a measure of distributional accuracy), and adjusted relative mean absolute error (MAE). Scores are aggregated separately for the most recent 10 weeks and for 52 historical weeks. To account for the variation in difficult of forecasting different weeks and locations, a pairwise approach was used to calculated the relative adjusted WIS and MAE. Models with relative scores lower than 1 have been more accurate than the baseline on average, whereas relative scores greater than 1 indicate less accuracy than baseline on average.
We generated scores in two ways, with the raw counts and with the log transformed counts. It has been argued that the log-transformation prior to scoring yields epidemiologically meaningful and easily interpretable results, while also reducing the impact of high-count locations on aggregated scores.
These evaluations are based on raw counts.
The first and second tables evaluate recent/historical forecast models based on their WIS and MAE by horizon.
The third and fourth tables evaluate recent/historical forecast models based on their prediction interval coverage at the 50% and 95% levels by horizon.
Scores are aggregated separately for the most recent 10 weeks and for 52 historical weeks. Since hospitalization forecasts are made at the daily timescale, computations for a given “week” are computed by averaging scores for the daily forecasts from Tuesday through Monday.
Inclusion criteria for each column are detailed below the table.
To calculate each column in our table, different inclusion criteria were applied. This table only includes forecasts for the last 10 weeks, since April 15, 2023. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.
The column titled, “# recent forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 10 week period.
Columns 3 through 6 calculate the adjusted relative WIS over the most recent 10 week period by horizon.
Columns 7 through 10 calculate the adjusted relative MAE over the most recent 10 week period by horizon.
To calculate each column in our table, different inclusion criteria were applied. This table only includes forecasts for the last 52 weeks, since June 25, 2022. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.
The column titled, “# historical forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 52 week period.
Columns 3 through 6 calculate the adjusted relative WIS over the most recent 52 week period by horizon.
Columns 7 through 10 calculate the adjusted relative MAE over the most recent 52 week period by horizon.
This table only includes forecasts for the last 10 weeks, since April 15, 2023. For inclusion in this table, the models must have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their 95% PI coverage, with the most accurate models aggregated across horizons at the top.
This table only includes forecasts for the last 52 weeks, since June 25, 2022. For inclusion in this table, the models must have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their 95% PI coverage, with the most accurate models aggregated across horizons at the top.
The data in this graph has been aggregated over all locations and submission weeks. We only included forecasts for the last 10 weeks. The models included have submitted at least 50% of forecasts during this time. This is the same exclusion criteria applied for WIS scores in the recent evaluation period.
The sum of the bars adds up to the WIS score. Of note, these values may not be exactly the same as the relative WIS scores shown in the leaderboard table because these are not adjusted for weeks or locations missing. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons. The y axis is truncated at 95th percentile of the sum of the bars across models, rounded up to the nearest 10.
In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states. The models in the legend with a dot and line have scores for every week. The models with just a line are missing scores for at least one week.
For the first 2 figures, WIS is used as a metric, WIS is used as a metric, with the y axis truncated at the 97.5 percentile of the average WIS. The first figure shows the mean WIS across all 50 states for submission weeks beginning June 25, 2022 at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon. Since hospitalization forecasts are made at the daily timescale, computations for a given “week” are computed by averaging scores for the daily forecasts from Tuesday through Monday.
In this figure, the dotted black line represents the average 1 week ahead error across all models. There is often larger error for the 4 week horizon compared to the 1 week horizon.
We would expect a well-calibrated model to have a value of 95% in this plot.
We would expect a well-calibrated model to have a value of 95% in this plot. There is typically larger error for the 4 week horizon compared to the 1 week horizon.
This figures below show recent model performance stratified by location. We only included forecasts for the last 10 weeks. Models were included if they had submitted forecasts for all 4 horizons and submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Locations are sorted by cumulative hospitalization counts.
The color scheme shows the WIS score relative to the baseline, across all horizons. The only locations evaluated are 50 states, selected jurisdictions and the national level forecast. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons.
This figure shows the number of incident reported Daily COVID-19 hospitalizations reported in the US. The vertical blue line indicates the beginning of the “recent” model evaluation period. The vertical green line indicates the beginning of the “historical” model evaluation period.
These evaluations are based on log-transformed counts, which was recommended by Bosse et al. (2023).
The first and second tables evaluate recent/historical forecast models based on their WIS and MAE by horizon, based on log-transformed counts.
Scores are aggregated separately for the most recent 10 weeks and for 52 historical weeks. Since hospitalization forecasts are made at the daily timescale, computations for a given “week” are computed by averaging scores for the daily forecasts from Tuesday through Monday.
Inclusion criteria for each column are detailed below the table.
To calculate each column in our table, different inclusion criteria were applied. This table only includes forecasts for the last 10 weeks, since April 15, 2023. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.
The column titled, “# recent forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 10 week period.
Columns 3 through 6 calculate the adjusted relative WIS over the most recent 10 week period by horizon.
Columns 7 through 10 calculate the adjusted relative MAE over the most recent 10 week period by horizon.
To calculate each column in our table, different inclusion criteria were applied. This table only includes forecasts for the last 52 weeks, since June 25, 2022. The models included have submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. The data are initially ordered by model based on their relative WIS score aggregated across horizons, with the most accurate models at the top.
The column titled, “# historical forecasts” lists the number of forecasts a team has submitted with a target end date over the most recent 52 week period.
Columns 3 through 6 calculate the adjusted relative WIS over the most recent 52 week period by horizon.
Columns 7 through 10 calculate the adjusted relative MAE over the most recent 52 week period by horizon.
The data in this graph has been aggregated over all locations and submission weeks. We only included forecasts for the last 10 weeks. The models included have submitted at least 50% of forecasts during this time. This is the same exclusion criteria applied for WIS scores in the recent evaluation period.
The sum of the bars adds up to the WIS score. Of note, these values may not be exactly the same as the relative WIS scores shown in the leaderboard table because these are not adjusted for weeks or locations missing. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons. The y axis is truncated at 95th percentile of the sum of the bars across models, rounded up to the nearest 10.
In the following figures, we have evaluated models across multiple forecasting weeks. Points included in this comparison are for models that have submitted probabilistic forecasts for all 50 states. The models in the legend with a dot and line have scores for every week. The models with just a line are missing scores for at least one week.
The first figure shows the mean WIS across all 50 states for submission weeks beginning June 25, 2022 at a 1 week horizon. The second figure shows the mean WIS aggregated across locations, however it is for a 4 week horizon. Since hospitalization forecasts are made at the daily timescale, computations for a given “week” are computed by averaging scores for the daily forecasts from Tuesday through Monday.
In this figure, the dotted black line represents the average 1 week ahead error across all models. There is often larger error for the 4 week horizon compared to the 1 week horizon.
This figures below show recent model performance stratified by location. We only included forecasts for the last 10 weeks. Models were included if they had submitted forecasts for all 4 horizons and submitted at least 50% of forecasts during this time, where one forecast is a location, target, forecast date combination. Locations are sorted by cumulative hospitalization counts.
The color scheme shows the WIS score relative to the baseline, across all horizons. The only locations evaluated are 50 states, selected jurisdictions and the national level forecast. The data are ordered on the x axis based on their relative WIS score shown in the accuracy table, aggregated across horizons.
This figure shows the number of incident reported Daily COVID-19 hospitalizations reported in the US. The vertical blue line indicates the beginning of the “recent” model evaluation period. The vertical green line indicates the beginning of the “historical” model evaluation period.